You are viewing the RapidMiner Studio documentation for version 10.1 - Check here for latest version
HTML to XML (Text Processing)
Synopsis
This operator converts a HTML document into an XML/XHTML document.Description
The HTML to XML operator takes a document in the HTML-Format and parses it into strict XHTML, removing things as non-closed stand-alone tags and so on. This can be useful, if an XHTML document is required, or it's necessary that the document is fully valid.
Input
- document
The HTML-document that should be transformed.
Output
- document
The XHTML-Document.
Tutorial Processes
Replace invalid HTML tags
In this example, we first generate an HTML document, which contains a lot of non-XHTML-conform Tags, like a non-closed li, non-closed stand-alone tags and <H1> instead of <h1>.
So we pass on this document into the HTML to XML operator.
When we now open the results, we'll see that the operator has replaced all invalid tags by their valid representations.